Skip to content

fix(retention-job): add chunking strategy for cleanup#4305

Merged
TheodoreSpeaks merged 2 commits intostagingfrom
fix/cleanup-log
Apr 27, 2026
Merged

fix(retention-job): add chunking strategy for cleanup#4305
TheodoreSpeaks merged 2 commits intostagingfrom
fix/cleanup-log

Conversation

@TheodoreSpeaks
Copy link
Copy Markdown
Collaborator

Summary

Add chunking cleanup strategy to avoid failures when number of cleanup items exceed maximum parameters in a sql query.
Unified logic to single helper method.

Type of Change

  • Bug fix
  • New feature
  • Breaking change
  • Documentation
  • Other: ___________

Testing

  • Tested locally

Checklist

  • Code follows project style guidelines
  • Self-reviewed my changes
  • Tests added/updated and passing
  • No new warnings introduced
  • I confirm that I have read and agree to the terms outlined in the Contributor License Agreement (CLA)

Screenshots/Videos

@vercel
Copy link
Copy Markdown

vercel Bot commented Apr 27, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

1 Skipped Deployment
Project Deployment Actions Updated (UTC)
docs Skipped Skipped Apr 27, 2026 7:02pm

Request Review

@TheodoreSpeaks
Copy link
Copy Markdown
Collaborator Author

@BugBot review

@cursor
Copy link
Copy Markdown

cursor Bot commented Apr 27, 2026

PR Summary

Medium Risk
Touches recurring retention cleanup jobs and the shared delete helper; incorrect chunking/limits could leave data unpruned or over-delete if predicates are wrong, though changes are mostly mechanical and bounded by existing retention filters.

Overview
Improves retention cleanup reliability at scale by chunking large workspace/ID lists before running SELECT/DELETE loops, avoiding oversized IN (...) queries and statement timeouts.

Introduces selectRowsByIdChunks and a generalized chunkedBatchDelete helper (with per-chunk batching, an optional per-batch side-effect hook, and an overall row cap), and updates the cleanup-logs, cleanup-soft-deletes, and cleanup-tasks jobs to use these chunked primitives; cleanup-logs also consolidates workflow-log deletion + file cleanup via the new hook and switches job log deletion to the shared wrapper.

Reviewed by Cursor Bugbot for commit 3c48b24. Bugbot is set up for automated code reviews on this repo. Configure here.

Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Bugbot reviewed your changes and found no new issues!

Comment @cursor review or bugbot run to trigger another review on this PR

Reviewed by Cursor Bugbot for commit 66a66d5. Configure here.

@TheodoreSpeaks TheodoreSpeaks marked this pull request as ready for review April 27, 2026 18:41
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Apr 27, 2026

Greptile Summary

This PR refactors the retention cleanup jobs to chunk large workspace-ID IN lists before issuing SELECT/DELETE queries, avoiding Postgres statement timeouts caused by probing every workspace range in the composite index. The duplicated per-table loop logic is consolidated into two shared helpers — chunkedBatchDelete (SELECT → optional side-effect → DELETE by ID) and selectRowsByWorkspaceChunks (SELECT-only, with a row cap).

Confidence Score: 4/5

Safe to merge — the fix addresses a real production failure with correct logic; the two P2 items are about failure-counter semantics and an uncapped per-run row budget that should be validated before deploying at scale.

No P0/P1 bugs found. Two P2 issues: result.failed semantics silently changed from row count to batch count (affects monitoring readability), and chunkedBatchDelete no longer has a fixed global row cap (could dramatically increase per-run work for large free-tier installs). All P2s, so ceiling is 4/5.

apps/sim/lib/cleanup/batch-delete.ts — the new chunkedBatchDelete carries the changed failed semantics and unbounded-per-chunk scaling; worth a close look before deploying to a large free-tier instance.

Important Files Changed

Filename Overview
apps/sim/lib/cleanup/batch-delete.ts Core library rewrite: adds chunkArray, selectRowsByWorkspaceChunks, and chunkedBatchDelete; the failed counter semantics silently changed from row count to batch count, and the effective per-run row cap now scales with workspace count rather than being fixed at 20 000.
apps/sim/background/cleanup-logs.ts Refactored to use chunkedBatchDelete for workflow execution logs (with S3 side-effects) and batchDeleteByWorkspaceAndTimestamp for job execution logs; logic is correct but file-delete stats are now only logged when filesTotal > 0, slightly reducing observability.
apps/sim/background/cleanup-soft-deletes.ts Workspace file / workflow selects now use selectRowsByWorkspaceChunks; copilotChats sub-select (line 208) is unchanged and still builds a potentially large IN list from doomedWorkflowIds.
apps/sim/background/cleanup-tasks.ts Straightforward migration: copilotChats and copilotRuns selects updated to use selectRowsByWorkspaceChunks; no issues found.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[workspaceIds] --> B[chunkArray\n50 IDs per chunk]
    B --> C{for each chunk}

    C --> D[selectChunk\nSELECT up to batchSize rows]
    D --> E{rows.length == 0?}
    E -- yes --> F[next chunk]
    E -- no --> G[onBatch?\ne.g. delete S3 files]
    G --> H[DELETE rows by ID]
    H --> I{rows.length == batchSize\nAND batches < maxBatches?}
    I -- yes --> D
    I -- no --> F
    F --> C

    D -. error .-> ERR[result.failed++\nskip to next chunk]
    G -. error .-> ERR
    H -. error .-> ERR

    C -- all chunks done --> LOG[Log: total deleted\nchunk failures]
Loading

Comments Outside Diff (1)

  1. apps/sim/background/cleanup-soft-deletes.ts, line 208-213 (link)

    P2 Potentially large IN clause not chunked

    doomedWorkflowIds is selected via selectRowsByWorkspaceChunks which caps at DEFAULT_BATCH_SIZE × DEFAULT_MAX_BATCHES_PER_TABLE = 20 000. That entire list is then passed to inArray(copilotChats.workflowId, doomedWorkflowIds) in a single query — potentially 20 000 bind parameters with no chunking. This is the same class of problem this PR fixes for workspace-ID IN lists. While this code predates the PR, the PR's own comment explains why large IN lists cause Postgres to blow the 90 s statement timeout, so this may need the same treatment.

Reviews (1): Last reviewed commit: "fix(retention-job): add chunking strateg..." | Re-trigger Greptile

Comment thread apps/sim/lib/cleanup/batch-delete.ts
Comment thread apps/sim/lib/cleanup/batch-delete.ts
@TheodoreSpeaks TheodoreSpeaks merged commit 65e17de into staging Apr 27, 2026
14 checks passed
@TheodoreSpeaks TheodoreSpeaks deleted the fix/cleanup-log branch April 27, 2026 19:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant